This dataset represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria.
The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result, diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization, etc.
Dataset can be found at https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
import matplotlib.pyplot as plt
import seaborn as sns
from bokeh.plotting import figure, output_file, show
from bokeh.io import output_notebook, show
from bokeh.layouts import gridplot
import pandas_bokeh
import plotly.express as px
from plotly.subplots import make_subplots
import random
from tabulate import tabulate
We define a seed to have reproducible results.
np.random.seed(77877)
We set the bokeh output to notebook
We define a permanent style for the plot in order to have an homogenous vizualisation
plt.style.use('default')
from bokeh.io import curdoc
curdoc().theme = 'dark_minimal'
All bokeh and plotly graphs are dynamic, so you can move the mouse over an element to see which data he refers to.
You can also interact with the legend of the plot to show/hide some categories
%matplotlib inline
We set the maximum number of columns allowed to be displayed to 100 so that the all dataframe can be printed
pd.set_option('display.max_columns', 100)
We create a dataframe by reading a csv file and importing the data
diabetes_df = pd.read_csv(r'../Dataset/diabetic_data.csv', sep =',', na_values="?", low_memory = False)
diabetes_df.head()
| encounter_id | patient_nbr | race | gender | age | weight | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | payer_code | medical_specialty | num_lab_procedures | num_procedures | num_medications | number_outpatient | number_emergency | number_inpatient | diag_1 | diag_2 | diag_3 | number_diagnoses | max_glu_serum | A1Cresult | metformin | repaglinide | nateglinide | chlorpropamide | glimepiride | acetohexamide | glipizide | glyburide | tolbutamide | pioglitazone | rosiglitazone | acarbose | miglitol | troglitazone | tolazamide | examide | citoglipton | insulin | glyburide-metformin | glipizide-metformin | glimepiride-pioglitazone | metformin-rosiglitazone | metformin-pioglitazone | change | diabetesMed | readmitted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2278392 | 8222157 | Caucasian | Female | [0-10) | NaN | 6 | 25 | 1 | 1 | NaN | Pediatrics-Endocrinology | 41 | 0 | 1 | 0 | 0 | 0 | 250.83 | NaN | NaN | 1 | None | None | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | NO |
| 1 | 149190 | 55629189 | Caucasian | Female | [10-20) | NaN | 1 | 1 | 7 | 3 | NaN | NaN | 59 | 0 | 18 | 0 | 0 | 0 | 276 | 250.01 | 255 | 9 | None | None | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Up | No | No | No | No | No | Ch | Yes | >30 |
| 2 | 64410 | 86047875 | AfricanAmerican | Female | [20-30) | NaN | 1 | 1 | 7 | 2 | NaN | NaN | 11 | 5 | 13 | 2 | 0 | 1 | 648 | 250 | V27 | 6 | None | None | No | No | No | No | No | No | Steady | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Yes | NO |
| 3 | 500364 | 82442376 | Caucasian | Male | [30-40) | NaN | 1 | 1 | 7 | 2 | NaN | NaN | 44 | 1 | 16 | 0 | 0 | 0 | 8 | 250.43 | 403 | 7 | None | None | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Up | No | No | No | No | No | Ch | Yes | NO |
| 4 | 16680 | 42519267 | Caucasian | Male | [40-50) | NaN | 1 | 1 | 7 | 1 | NaN | NaN | 51 | 0 | 8 | 0 | 0 | 0 | 197 | 157 | 250 | 5 | None | None | No | No | No | No | No | No | Steady | No | No | No | No | No | No | No | No | No | No | Steady | No | No | No | No | No | Ch | Yes | NO |
diabetes_df.tail()
| encounter_id | patient_nbr | race | gender | age | weight | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | payer_code | medical_specialty | num_lab_procedures | num_procedures | num_medications | number_outpatient | number_emergency | number_inpatient | diag_1 | diag_2 | diag_3 | number_diagnoses | max_glu_serum | A1Cresult | metformin | repaglinide | nateglinide | chlorpropamide | glimepiride | acetohexamide | glipizide | glyburide | tolbutamide | pioglitazone | rosiglitazone | acarbose | miglitol | troglitazone | tolazamide | examide | citoglipton | insulin | glyburide-metformin | glipizide-metformin | glimepiride-pioglitazone | metformin-rosiglitazone | metformin-pioglitazone | change | diabetesMed | readmitted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 101761 | 443847548 | 100162476 | AfricanAmerican | Male | [70-80) | NaN | 1 | 3 | 7 | 3 | MC | NaN | 51 | 0 | 16 | 0 | 0 | 0 | 250.13 | 291 | 458 | 9 | None | >8 | Steady | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Down | No | No | No | No | No | Ch | Yes | >30 |
| 101762 | 443847782 | 74694222 | AfricanAmerican | Female | [80-90) | NaN | 1 | 4 | 5 | 5 | MC | NaN | 33 | 3 | 18 | 0 | 0 | 1 | 560 | 276 | 787 | 9 | None | None | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Steady | No | No | No | No | No | No | Yes | NO |
| 101763 | 443854148 | 41088789 | Caucasian | Male | [70-80) | NaN | 1 | 1 | 7 | 1 | MC | NaN | 53 | 0 | 9 | 1 | 0 | 0 | 38 | 590 | 296 | 13 | None | None | Steady | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Down | No | No | No | No | No | Ch | Yes | NO |
| 101764 | 443857166 | 31693671 | Caucasian | Female | [80-90) | NaN | 2 | 3 | 7 | 10 | MC | Surgery-General | 45 | 2 | 21 | 0 | 0 | 1 | 996 | 285 | 998 | 9 | None | None | No | No | No | No | No | No | Steady | No | No | Steady | No | No | No | No | No | No | No | Up | No | No | No | No | No | Ch | Yes | NO |
| 101765 | 443867222 | 175429310 | Caucasian | Male | [70-80) | NaN | 1 | 1 | 7 | 6 | NaN | NaN | 13 | 3 | 3 | 0 | 0 | 0 | 530 | 530 | 787 | 9 | None | None | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | NO |
By fisrt obersving the data, we can see that there are a lot of values in this dataset. There are 101 766 row and 50 columns. We will certainly have to analyze and clean up these data a little to reduce them and make them more treatable.
Let's use the info() method in order to observe the structure and the type of the dataframe
diabetes_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 101766 entries, 0 to 101765 Data columns (total 50 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 encounter_id 101766 non-null int64 1 patient_nbr 101766 non-null int64 2 race 99493 non-null object 3 gender 101766 non-null object 4 age 101766 non-null object 5 weight 3197 non-null object 6 admission_type_id 101766 non-null int64 7 discharge_disposition_id 101766 non-null int64 8 admission_source_id 101766 non-null int64 9 time_in_hospital 101766 non-null int64 10 payer_code 61510 non-null object 11 medical_specialty 51817 non-null object 12 num_lab_procedures 101766 non-null int64 13 num_procedures 101766 non-null int64 14 num_medications 101766 non-null int64 15 number_outpatient 101766 non-null int64 16 number_emergency 101766 non-null int64 17 number_inpatient 101766 non-null int64 18 diag_1 101745 non-null object 19 diag_2 101408 non-null object 20 diag_3 100343 non-null object 21 number_diagnoses 101766 non-null int64 22 max_glu_serum 101766 non-null object 23 A1Cresult 101766 non-null object 24 metformin 101766 non-null object 25 repaglinide 101766 non-null object 26 nateglinide 101766 non-null object 27 chlorpropamide 101766 non-null object 28 glimepiride 101766 non-null object 29 acetohexamide 101766 non-null object 30 glipizide 101766 non-null object 31 glyburide 101766 non-null object 32 tolbutamide 101766 non-null object 33 pioglitazone 101766 non-null object 34 rosiglitazone 101766 non-null object 35 acarbose 101766 non-null object 36 miglitol 101766 non-null object 37 troglitazone 101766 non-null object 38 tolazamide 101766 non-null object 39 examide 101766 non-null object 40 citoglipton 101766 non-null object 41 insulin 101766 non-null object 42 glyburide-metformin 101766 non-null object 43 glipizide-metformin 101766 non-null object 44 glimepiride-pioglitazone 101766 non-null object 45 metformin-rosiglitazone 101766 non-null object 46 metformin-pioglitazone 101766 non-null object 47 change 101766 non-null object 48 diabetesMed 101766 non-null object 49 readmitted 101766 non-null object dtypes: int64(13), object(37) memory usage: 38.8+ MB
diabetes_df.describe().apply(lambda s: s.apply('{0:.2f}'.format))
| encounter_id | patient_nbr | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | num_lab_procedures | num_procedures | num_medications | number_outpatient | number_emergency | number_inpatient | number_diagnoses | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 101766.00 | 101766.00 | 101766.00 | 101766.00 | 101766.00 | 101766.00 | 101766.00 | 101766.00 | 101766.00 | 101766.00 | 101766.00 | 101766.00 | 101766.00 |
| mean | 165201645.62 | 54330400.69 | 2.02 | 3.72 | 5.75 | 4.40 | 43.10 | 1.34 | 16.02 | 0.37 | 0.20 | 0.64 | 7.42 |
| std | 102640295.98 | 38696359.35 | 1.45 | 5.28 | 4.06 | 2.99 | 19.67 | 1.71 | 8.13 | 1.27 | 0.93 | 1.26 | 1.93 |
| min | 12522.00 | 135.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| 25% | 84961194.00 | 23413221.00 | 1.00 | 1.00 | 1.00 | 2.00 | 31.00 | 0.00 | 10.00 | 0.00 | 0.00 | 0.00 | 6.00 |
| 50% | 152388987.00 | 45505143.00 | 1.00 | 1.00 | 7.00 | 4.00 | 44.00 | 1.00 | 15.00 | 0.00 | 0.00 | 0.00 | 8.00 |
| 75% | 230270887.50 | 87545949.75 | 3.00 | 4.00 | 7.00 | 6.00 | 57.00 | 2.00 | 20.00 | 0.00 | 0.00 | 1.00 | 9.00 |
| max | 443867222.00 | 189502619.00 | 8.00 | 28.00 | 25.00 | 14.00 | 132.00 | 6.00 | 81.00 | 42.00 | 76.00 | 21.00 | 16.00 |
Now we want to know and understand the signification of the differents columns in order to analyse them. With this dataset, we have a pdf document explaining some characteristics of these data:
Here, we will use the pandas_profiling library to call the profil_report() function. This library extends the DataFrame for a quick analysis
This report contains several informations such as the types of columns in the dataframe, the uniques and missing values, it will do some quantile and descriptive statistics, will search some correlations between the differents features. We will use it as a base to clean up our data.
report = ProfileReport(diabetes_df, title = "Diabetes dataset overview", dark_mode = True)
report